58 research outputs found
Adaptive indexing in modern database kernels
Physical design represents one of the hardest problems for database management systems. Without proper tuning, systems cannot achieve good performance. Offline indexing creates indexes a priori assuming good workload knowledge and idle time. More recently, online indexing monitors the workload trends and creates or drops indexes online. Adaptive indexing takes another step towards completely automating the tuning process of a database system, by enabling incremental and partial online indexing. The main idea is that physical design changes continuously, adaptively, partially, incrementally and on demand while processing queries as part of the execution operators. As such it brings a plethora of opportunities for rethinking and improving every single corner of database system design.
We will analyze the indexing space between offline, online and adaptive indexing through several state of the art indexing techniques, e. g., what-if analysis and soft indexes. We will discuss in detail adaptive indexing techniques such as database cracking, adaptive merging, sideways cracking and various hybrids that try to balance the online tuning overhead with the convergence speed to optimal performance. In addition, we will discuss how various aspects of modern techniques for database architectures, such as vectorization, bulk processing, column-store execution and storage affect adaptive indexing. Finally, we will discuss several open research topics towards fully automomous database kernels
Self-organizing tuple reconstruction in column-stores
Column-stores gained popularity as a promising physical
design alternative. Each attribute of a relation is physically stored as a separate column
allowing queries to
load only the required attributes.
The overhead incurred is
on-the-fly tuple reconstruction for multi-attribute queries.
Each tuple reconstruction
is a join of two columns based on tuple IDs, making it a
significant cost component.
The ultimate physical design is to have multiple presorted copies of each base table
such that tuples are already appropriately organized in multiple different orders across the various columns. This requires
the ability to predict the workload, idle time to prepare, and infrequent updates.
In this paper, we propose a novel design, \emph{partial sideways cracking},
that
minimizes the tuple rec
Benchmarking adaptive indexing
Ideally, realizing the best physical design for the current and all subsequent
workloads would impact neither performance nor storage usage.
In reality, workloads and datasets can
change dramatically over time and index creation impacts the
performance of concurrent user and system activity.
We propose a framework that evaluates the key premise
of adaptive indexing --- a new indexing paradigm where index creation and re-organization
take place automatically and incrementally,
as a side-effect of query execution.
We focus on how the incremental costs and benefits of dynamic
reorganization are distributed across the workload's lifetime.
We believe measuring
the costs and utility of the stages of adaptation
are relevant metrics
for evaluating new query processing paradigms
and comparing them to traditional approaches
Estimating the compression fraction of an index using sampling
Data compression techniques such as null suppression
and dictionary compression are commonly used in today’s
database systems. In order to effectively leverage compression, it
is necessary to have the ability to efficiently and accurately
estimate the size of an index if it were to be compressed. Such an
analysis is critical if automated physical design tools are to be
extended to handle compression. Several database systems today
provide estimators for this problem based on random sampling.
While this approach is efficient, there is no previous work that
analyses its accuracy. In this paper, we analyse the problem of
estimating the compressed size of an index from the point of view
of worst-case guarantees. We show that the simple estimator
implemented by several database systems has several “good”
cases even though the estimator itself is agnostic to the internals
of the specific compression algorithm.
efficiently. The naĂŻve method of actually building and
compressing the index in order to estimate its size, while
highly accurate is prohibitively inefficient.
Thus, we need to be able to accurately estimate the
compressed size of an index without incurring the cost of
actually compressing it. This problem is challenging because
the size of the compressed index can depend significantly on
the data distribution as well as the compression technique
used. This is in contrast with the estimation of the size of an
uncompressed index in physical database design tools which
can be derived in a straightforward manner from the schema
(which defines the size of the corresponding column) and the
number of rows in the table
Evaluating Conjunctive Triple Pattern Queries over Large Structured Overlay Networks
We study the problem of evaluating conjunctive queries com-
posed of triple patterns over RDF data stored in distributed hash tables.
Our goal is to develop algorithms that scale to large amounts of RDF
data, distribute the query processing load evenly and incur little network
traffic. We present and evaluate two novel query processing algorithms
with these possibly conflicting goals in mind. We discuss the various
tradeoffs that occur in our setting through a detailed experimental eval-
uation of the proposed algorithms
Optimal column layout for hybrid workloads (VLDB 2020 talk)
Data-intensive analytical applications need to support both efficient
reads and writes. However, what is usually a good data layout for
an update-heavy workload, is not well-suited for a read-mostly one
and vice versa. Modern analytical data systems rely on columnar
layouts and employ delta stores to inject new data and updates.
We show that for hybrid workloads we can achieve close to one
order of magnitude better performance by tailoring the column layout
design to the data and query workload. Our approach navigates
the possible design space of the physical layout: it organizes each
column’s data by determining the number of partitions, their corresponding
sizes and ranges, and the amount of buffer space and how
it is allocated. We frame these design decisions as an optimization
problem that, given workload knowledge and performance requirements,
provides an optimal physical layout for the workload
at hand. To evaluate this work, we build an in-memory storage engine,
Casper, and we show that it outperforms state-of-the-art data
layouts of analytical systems for hybrid workloads. Casper delivers
up to 2:32 higher throughput for update-intensive workloads
and up to 2:14 higher throughput for hybrid workloads. We further
show how to make data layout decisions robust to workload
variation by carefully selecting the input of the optimization.http://www.vldb.org/pvldb/vol12/p2393-athanassoulis.pdfPublished versio
Enhanced Stream Processing in a DBMS Kernel
Continuous query processing has emerged as a promising query processing paradigm with numerous applications. A recent development is the need to handle both streaming queries and typical one-time queries in the same application. For example, data warehousing can greatly benefit from the integration of stream semantics, i.e., online analysis of incoming data and combination with existing data. This is especially useful to provide low latency in data-intensive analysis in big data warehouses that are augmented with new data on a daily basis.
However, state-of-the-art database technology cannot handle streams efficiently due to their "continuous" nature. At the same time, state-of-the-art stream technology is purely focused on stream applications. The research efforts are mostly geared towards the creation of specialized stream management systems built with a different philosophy than a DBMS. The drawback of this approach is the limited opportunities to exploit successful past data processing technology, e.g., query optimization techniques.
For this new problem we need to combine the best of both worlds. Here we take a completely different route by designing a stream engine on top of an existing relational database kernel. This includes reuse of both its storage/execution engine and its optimizer infrastructure. The major challenge then becomes the efficient support for specialized stream features. This paper focuses on incremental window-based processing, arguably the most crucial stream-specific requirement. In order to maintain and reuse the generic storage and execution model of the DBMS, we elevate the problem at the query plan level. Proper op
Exploiting the Power of Relational Databases for Efficient Stream Processing
Stream applications gained significant popularity over
the last years that lead to the development of specialized stream engines.
These systems are designed from scratch with a different
philosophy than nowadays database engines in order to cope with the stream applications requirements.
However, this means that they lack the power and sophisticated techniques of a full fledged
database system that exploits techniques and algorithms accumulated over many years of database research.
In this paper, we take the opposite route and design a stream engine directly on top of a database kernel.
Incoming tuples are directly stored upon arrival in a new kind of system tables, called baskets.
A continuous query can then be evaluated over its relevant baskets as a typical one-time query
exploiting the power of the relational engine.
Once a tuple has been seen by all relevant queries/operators, it is dropped from its basket.
A basket can be the input to a single or multiple similar query plans.
Furthermore, a query plan can be split into multiple parts each one with its own
input/output baskets allowing for flexible load sharing query scheduling.
Contrary to traditional stream engines, that process one tuple at a time,
this model allows batch processing of tuples, e.g., query a basket only after tuples arrive
or after a time threshold has passed.
Furthermore, we are not restricted to process tuples in the order they arrive.
Instead, we can selectively pick tuples from a basket based on the query requirements exploiting
a novel query component, the basket expressions.
We investigate the opportunities and challenges that arise with such a direction and we show that it carries significant advantages.
We propose a complete architecture, the DataCell, which we implemented on top of an open-source column-oriented DBMS.
A detailed analysis and experimental evaluation of the core algorithms using both micro benchmarks and
the standard Linear Road benchmark demonstrate the potential of this new approach
- …